Goto

Collaborating Authors

 Midlothian



Know Your Limits: A Survey of Abstention in Large Language Models

arXiv.org Artificial Intelligence

But questions of Large language models (LLMs) have demonstrated human values and the answerability of the query generalization capabilities across NLP tasks such itself are difficult to model in terms of model confidence as question answering (QA) (Wei et al., 2022; (Yang et al., 2023). Chowdhery et al., 2022), abstractive summarization (Zhang et al., 2023a), and dialogue generation While prior work demonstrates the potential of (Yi et al., 2024). But these models are also unreliable, abstention in enhancing model safety and reliability having a tendency to "hallucinate" false information (Varshney et al., 2023; Wang et al., 2024c; in their responses (Ji et al., 2023b), generate Zhang et al., 2024a), the study of abstention has overly certain or authoritative responses (Zhou also been constrained to specific QA tasks. This et al., 2024b), answer with incomplete information task-specific approach limits the broader applicability (Zhou et al., 2023b), or produce harmful or of abstention strategies across the diverse dangerous responses (Anwar et al., 2024). In these range of scenarios encountered by general-purpose situations, the model should ideally abstain: to chatbots engaging in open-domain interactions.


The Literature Review Network: An Explainable Artificial Intelligence for Systematic Literature Reviews, Meta-analyses, and Method Development

arXiv.org Artificial Intelligence

Systematic literature reviews are the highest quality of evidence in research. However, the review process is hindered by significant resource and data constraints. The Literature Review Network (LRN) is the first of its kind explainable AI platform adhering to PRISMA 2020 standards, designed to automate the entire literature review process. LRN was evaluated in the domain of surgical glove practices using 3 search strings developed by experts to query PubMed. A non-expert trained all LRN models. Performance was benchmarked against an expert manual review. Explainability and performance metrics assessed LRN's ability to replicate the experts' review. Concordance was measured with the Jaccard index and confusion matrices. Researchers were blinded to the other's results until study completion. Overlapping studies were integrated into an LRN-generated systematic review. LRN models demonstrated superior classification accuracy without expert training, achieving 84.78% and 85.71% accuracy. The highest performance model achieved high interrater reliability (k = 0.4953) and explainability metrics, linking 'reduce', 'accident', and 'sharp' with 'double-gloving'. Another LRN model covered 91.51% of the relevant literature despite diverging from the non-expert's judgments (k = 0.2174), with the terms 'latex', 'double' (gloves), and 'indication'. LRN outperformed the manual review (19,920 minutes over 11 months), reducing the entire process to 288.6 minutes over 5 days. This study demonstrates that explainable AI does not require expert training to successfully conduct PRISMA-compliant systematic literature reviews like an expert. LRN summarized the results of surgical glove studies and identified themes that were nearly identical to the clinical researchers' findings. Explainable AI can accurately expedite our understanding of clinical practices, potentially revolutionizing healthcare research.


TrustUQA: A Trustful Framework for Unified Structured Data Question Answering

arXiv.org Artificial Intelligence

Natural language question answering (QA) over structured data sources such as tables and knowledge graphs (KGs) have been widely investigated, for example with Large Language Models (LLMs). The main solutions include question to formal query parsing and retrieval-based answer generation. However, current methods of the former often suffer from weak generalization, failing to dealing with multiple sources simultaneously, while the later is limited in trustfulness. In this paper, we propose UnifiedTQA, a trustful QA framework that can simultaneously support multiple types of structured data in a unified way. To this end, it adopts an LLM-friendly and unified knowledge representation method called Condition Graph (CG), and uses an LLM and demonstration-based two-level method for CG querying. For enhancement, it is also equipped with dynamic demonstration retrieval. We have evaluated UnifiedTQA with 5 benchmarks covering 3 types of structured data. It outperforms 2 existing unified structured data QA methods and in comparison with the baselines that are specific to a data type, it achieves state-of-the-art on 2 of them. Further more, we demonstrates potential of our method for more general QA tasks, QA over mixed structured data and QA across structured data.


ERA-CoT: Improving Chain-of-Thought through Entity Relationship Analysis

arXiv.org Artificial Intelligence

Large language models (LLMs) have achieved commendable accomplishments in various natural language processing tasks. However, LLMs still encounter significant challenges when dealing with complex scenarios involving multiple entities. These challenges arise from the presence of implicit relationships that demand multi-step reasoning. In this paper, we propose a novel approach ERA-CoT, which aids LLMs in understanding context by capturing relationships between entities and supports the reasoning of diverse tasks through Chain-of-Thoughts (CoT). Experimental results show that ERA-CoT demonstrates the superior performance of our proposed method compared to current CoT prompting methods, achieving a significant improvement of an average of 5.1\% on GPT3.5 compared to previous SOTA baselines. Our analysis indicates that ERA-CoT increases the LLM's understanding of entity relationships, significantly improves the accuracy of question answering, and enhances the reasoning ability of LLMs.


Characterizing LLM Abstention Behavior in Science QA with Context Perturbations

arXiv.org Artificial Intelligence

The correct model response in the face of uncertainty is to abstain from answering a question so as not to mislead the user. In this work, we study the ability of LLMs to abstain from answering context-dependent science questions when provided insufficient or incorrect context. We probe model sensitivity in several settings: removing gold context, replacing gold context with irrelevant context, and providing additional context beyond what is given. In experiments on four QA datasets with four LLMs, we show that performance varies greatly across models, across the type of context provided, and also by question type; in particular, many LLMs seem unable to abstain from answering boolean questions using standard QA prompts. Our analysis also highlights the unexpected impact of abstention performance on QA task accuracy. Counter-intuitively, in some settings, replacing gold context with irrelevant context or adding irrelevant context to gold context can improve abstention performance in a way that results in improvements in task performance. Our results imply that changes are needed in QA dataset design and evaluation to more effectively assess the correctness and downstream impacts of model abstention.


Bridging the Preference Gap between Retrievers and LLMs

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have demonstrated superior results across a wide range of tasks, while retrieval has long been established as an effective means of obtaining task-relevant information for humans. Retrieval-augmented Generation (RAG) are known for their effectiveness in knowledge-intensive tasks by locating relevant information and placing it within the context window of the LLM. However, the relationship between retrievers and LLMs is still under-investigated. Most existing work treats the retriever and the LLM as independent components and leaves a gap between retrieving human-friendly information and assembling a LLM-friendly context. In this work, we examine a novel bridge model, validate the ranking and selection assumptions in retrievers in the context of RAG, and propose a training framework that chains together supervised and reinforcement learning to learn a bridge model. Empirical results demonstrate the effectiveness of our method in both question-answering and personalized generation tasks.


Towards Environmentally Equitable AI via Geographical Load Balancing

arXiv.org Artificial Intelligence

Fueled by the soaring popularity of large language and foundation models, the accelerated growth of artificial intelligence (AI) models' enormous environmental footprint has come under increased scrutiny. While many approaches have been proposed to make AI more energy-efficient and environmentally friendly, environmental inequity -- the fact that AI's environmental footprint can be disproportionately higher in certain regions than in others -- has emerged, raising social-ecological justice concerns. This paper takes a first step toward addressing AI's environmental inequity by balancing its regional negative environmental impact. Concretely, we focus on the carbon and water footprints of AI model inference and propose equity-aware geographical load balancing (GLB) to explicitly address AI's environmental impacts on the most disadvantaged regions. We run trace-based simulations by considering a set of 10 geographically-distributed data centers that serve inference requests for a large language AI model. The results demonstrate that existing GLB approaches may amplify environmental inequity while our proposed equity-aware GLB can significantly reduce the regional disparity in terms of carbon and water footprints.


Objaverse: A Universe of Annotated 3D Objects

arXiv.org Artificial Intelligence

Massive data corpora like WebText, Wikipedia, Conceptual Captions, WebImageText, and LAION have propelled recent dramatic progress in AI. Large neural models trained on such datasets produce impressive results and top many of today's benchmarks. A notable omission within this family of large-scale datasets is 3D data. Despite considerable interest and potential applications in 3D vision, datasets of high-fidelity 3D models continue to be mid-sized with limited diversity of object categories. Addressing this gap, we present Objaverse 1.0, a large dataset of objects with 800K+ (and growing) 3D models with descriptive captions, tags, and animations. Objaverse improves upon present day 3D repositories in terms of scale, number of categories, and in the visual diversity of instances within a category. We demonstrate the large potential of Objaverse via four diverse applications: training generative 3D models, improving tail category segmentation on the LVIS benchmark, training open-vocabulary object-navigation models for Embodied AI, and creating a new benchmark for robustness analysis of vision models. Objaverse can open new directions for research and enable new applications across the field of AI.


VALUE: Understanding Dialect Disparity in NLU

arXiv.org Artificial Intelligence

English Natural Language Understanding (NLU) systems have achieved great performances and even outperformed humans on benchmarks like GLUE and SuperGLUE. However, these benchmarks contain only textbook Standard American English (SAE). Other dialects have been largely overlooked in the NLP community. This leads to biased and inequitable NLU systems that serve only a sub-population of speakers. To understand disparities in current models and to facilitate more dialect-competent NLU systems, we introduce the VernAcular Language Understanding Evaluation (VALUE) benchmark, a challenging variant of GLUE that we created with a set of lexical and morphosyntactic transformation rules. In this initial release (V.1), we construct rules for 11 features of African American Vernacular English (AAVE), and we recruit fluent AAVE speakers to validate each feature transformation via linguistic acceptability judgments in a participatory design manner. Experiments show that these new dialectal features can lead to a drop in model performance. To run the transformation code and download both synthetic and gold-standard dialectal GLUE benchmarks, see https://github.com/SALT-NLP/value